On counting position weight matrix matches in a sequence, with application to discriminative motif finding
نویسنده
چکیده
MOTIVATION AND RESULTS The position weight matrix (PWM) is a popular method to model transcription factor binding sites. A fundamental problem in cis-regulatory analysis is to "count" the occurrences of a PWM in a DNA sequence. We propose a novel probabilistic score to solve this problem of counting PWM occurrences. The proposed score has two important properties: (1) It gives appropriate weights to both strong and weak occurrences of the PWM, without using thresholds. (2) For any given PWM, this score can be computed while allowing for occurrences of other, a priori known PWMs, in a statistically sound framework. Additionally, the score is efficiently differentiable with respect to the PWM parameters, which has important consequences for designing search algorithms. The second problem we address is to find, ab initio, PWMs that have high counts in one set of sequences, and low counts in another. We develop a novel algorithm to solve this "discriminative motif-finding problem", using the proposed score for counting a PWM in the sequences. The algorithm is a local search technique that exploits derivative information on an objective function to enhance speed and performance. It is extensively tested on synthetic data, and shown to perform better than other discriminative as well as non-discriminative PWM finding algorithms. It is then applied to cis-regulatory modules involved in development of the fruitfly embryo, to elicit known and novel motifs. We finally use the algorithm on genes predictive of social behavior in the honey bee, and find interesting motifs. AVAILABILITY The program is available upon request from the author.
منابع مشابه
T-Reg Comparator: an analysis tool for the comparison of position weight matrices
T-Reg Comparator is a novel software tool designed to support research into transcriptional regulation. Sequence motifs representing transcription factor binding sites are usually encoded as position weight matrices. The user inputs a set of such weight matrices or binding site sequences and our program matches them against the T-Reg database, which is presently built on data from the Transfac ...
متن کامل3MATRIX and 3MOTIF: a protein structure visualization system for conserved sequence motifs
Computational methods such as sequence alignment and motif construction are useful in grouping related proteins into families, as well as helping to annotate new proteins of unknown function. These methods identify conserved amino acids in protein sequences, but cannot determine the specific functional or structural roles of conserved amino acids without additional study. In this work, we prese...
متن کاملLecture 7 Statistical Models January 27 , 1998 Lecturer
In previous lectures we have discussed using the general expectation maximization algorithm to find an optimal weight matrix that indicated the probability for finding a given symbol at a given location in a motif. The weight matrix model has two main limitations in its ability to describe a family of related sequences, protein or otherwise. First of all, the weight matrix only allows compariso...
متن کاملHIGEDA: a hierarchical gene-set genetics based algorithm for finding subtle motifs in biological sequences
MOTIVATION Identification of motifs in biological sequences is a challenging problem because such motifs are often short, degenerate, and may contain gaps. Most algorithms that have been developed for motif-finding use the expectation-maximization (EM) algorithm iteratively. Although EM algorithms can converge quickly, they depend strongly on initialization parameters and can converge to local ...
متن کاملA Semi-Supervised Ensemble Learning Method for Finding Discriminative Motifs and its Application
Finding discriminative motifs has recently received much attention in biomedicine as such motifs allow us to characterize in distinguishing two different classes of sequences. It is common in biomedical applications that the quantity of labeled sequences is very limited while a large number of unlabeled sequences is usually available. The current methods of discriminative motif finding are powe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Bioinformatics
دوره 22 14 شماره
صفحات -
تاریخ انتشار 2006